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Abstract 

We present clustering methods for multivariate data exploiting the underlying 
geometry of the graphical structure between variables. As opposed to standard ap¬ 
proaches that assume known graph structures, we first estimate the edge structure of 
the unknown graph using Bayesian neighborhood selection approaches, wherein we 
account for the uncertainty of graphical structure learning through model-averaged 
estimates of the suitable parameters. Subsequently, we develop a nonparametric 
graph clustering model on the lower dimensional projections of the graph based on 
Laplacian embeddings using Dirichlet process mixture models. In contrast to stan¬ 
dard algorithmic approaches, this fully probabilistic approach allows incorporation of 
uncertainty in estimation and inference for both graph structure learning and clus¬ 
tering. More importantly, we formalize the arguments for Laplacian embeddings 
as suitable projections for graph clustering by providing theoretical support for the 
consistency of the eigenspace of the estimated graph Laplacians. We develop fast 
computational algorithms that allow our methods to scale to large number of nodes. 
Through extensive simulations we compare our clustering performance with standard 
clustering methods. We apply our methods to a novel pan-cancer proteomic data set, 
and evaluate protein networks and clusters across multiple different cancer types. 

Keywords: Dirichlet process mixture models. Graph clustering. Graph Laplacian, Graphical 
models, Proteomic data. Spectral clustering. 
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1 Introduction 


Clustering is one of the most widely studied approaches for investigating dependencies in 
multivariate data that arise in several domains, such as the biomedical and social sciences. 
Standard clustering approaches are based on metrics that define the similarity and/or 
dependence between variables. Examples are the use of linkage-based (e.g., hierarchical 
clustering), distance-based (Euclidean, Manhattan, Minkowski) or kernel-based metrics. 
In this article, we focus on a particular sub-class of clustering methods that are based on 
graphical dependencies between the variables, especially for data lying on a graph, or data 
that can be modeled probabilistically using a graphical model. In many complex systems, 
graphs or networks can effectively represent the inter-dependence between the major vari¬ 
ables of interest in various modeling contexts. Examples include protein-protein interaction 
networks in various cancer types, which may be modeled statistically to reveal the groups 
of proteins that play signihcant roles in different disease etiologies; social networks for users 
or communities that share common interests; and image networks, which help to identify 
similarly classified images. 

An important challenge of the investigation, when the number of nodes in a graph 
or network increases significantly, is to develop computationally efficient approaches for 
learning the network structure as well as inferring from the same, especially identifying 
the underlying clusters within the network. The investigation can also be addressed by 
identifying clusters of objects in the entire network, or as a graph partitioning problem, 
where the graph is partitioned into clusters such that the within-cluster connections are 
high and the between-cluster connections are low. Clusters of objects often help to reveal 
the underlying mechanism of the objects in the network with respect to the problem of 
interest. The graph partitioning problem is quite prevalent in the computer science and 
machine learning literature, where the main focus is on partitioning a given graph with 


known edge structures [see Von Luxburg (2007) and a comprehensive list of references 
therein]. 


From a modeling standpoint, graphical models (Lauritzen, 1996) are able to capture 
the conditional dependence structure through parametric (usually Gaussian) formulations 
on covariance matrices or precision (inverse covariance) matrices. Bayesian methods of 
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inference that use graphical models have received a great deal of attention recently. One 
approach uses prior specihcations on the space of sparse precision matrices. A conjugate 


family of priors, known as the G-Wishart prior (Roverato, 2000) has been developed for in¬ 
complete decomposable graphs. A more general family of conjugate priors for the precision 


matrix is the khp^-Wishart family of distributions; see Letac and Massam (2007); Rajarat- 


nam et al. (2008); Banerjee and Ghosal (2014). Bayesian methods of inference that use 


priors on elements of the precision matrix have also been proposed, e.g., Bayesian graphical 


lasso (Wang, 2012), shrinkage priors for precision matrices (Khondker et al., 2013 Wang 


and Pillai, 2013), and Bayesian graphical structure learning (|Banerjee and Ghosal 2015 


Wang, 2015). An alternative approach is to use regression models to estimate the preci¬ 


sion matrix, including neighborhood selection of the vertices of a graph (see Peng et al. 


(2009); Meinshausen and Biihlmann (2006)). Recent work by Kundu et al. (2013) estimates 
the graphical model by using a post-model htting neighborhood selection approach that is 
motivated by variable selection using penalized credible regions in the regression set-up. 


as introduced in Bondell and Reich (2012). Precision matrix estimation using Bayesian 
regression methods has been studied by Dobra et al. (2004); Bhadra and Malll^ (2013), 
where priors have been used on the regression coefficients, inducing sparsity. 

The above methods focus solely on graph structure learning. In this article, we focus on 
a broader problem of hrst learning the structure of the underlying graphical model and then 
learning the clusters using a valid probability model for the graph and the clusters within 
the graph. When the graphical structure is known, widely used algorithmic graph parti¬ 


tioning/ cutting methods can be used, including spectral graph clustering methods (Shi and 


Malik, 2000 Ng et al., 2002 Von Luxburg, 2007). Spectral methods serve to explore the 


geometry of the graphical structure and learn the underlying clusters by using the eigen¬ 
vectors of suitable matrices (graph Laplacians) obtained from the graph. These methods 


originated from primary works of Fiedler (1973) and Donath and Hoffman (1973). While 
these methods have attractive computational properties, the main drawbacks are that they 
assume known graph structures and, due to the heuristic/algorithmic nature of the formu¬ 
lation, do not allow for the explicit incorporation of uncertainty in estimation and inference 
for both graph structure learning and clustering. For unknown graphical structures, a suit- 
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able measure of pairwise similarities between the nodes is defined, which is then used to 
derive the graph Laplacian. This relates to nonlinear dimension reduction techniques or 
manifold learning such as diffusion maps (Coifman and Lafon, 2006| and Laplacian eigen- 


maps (Belkin and Niyogi, 2003). The eigenspace of the graph Laplacian can identify the 
connected components of the graph; hence, it is important to use a suitable graph Laplacian 
that is close to the ‘true’ graph Laplacian of the underlying graph. Asymptotic convergence 
of the graph Laplacian has been well studied in the literature for random graphs, that is, 
for graphs in which the data points are random samples from a suitable probability dis¬ 


tribution [see Rohe et ah (2011) for a comprehensive list of references]. However, we deal 
with graphical models in which the edges do not follow a particular probability distribution, 
but are dehned through suitable strength of association between different variables. In our 
case, such associations can be effectively expressed through partial correlations so that the 
conditional dependence structure of the corresponding graphical model is well preserved 
and interpretable. 

In this article, we propose a fully probabilistic approach to the problem. Specifically, 
we consider a p-dimensional Gaussian random variable X = (Xi,... ,Xp)^, where the p 
components of the random variable can be separated into K p (unique) clusters. The 
primary inferential aim is to retrieve the clusters along with their (connected) components 
from a random sample of size n from the underlying distribution, where the dimension p may 
grow with n. We de-convolve the problem into two sub-problems. First, we estimate the 
edge structure of the unknown graph using a Bayesian neighborhood selection approach, 
wherein we account for the uncertainty of graphical structure learning through model- 
averaged estimates of the suitable parameters. Second, conditional on the model-averaged 
estimate, we develop a nonparametric graph clustering on the Laplacian embedding of 
the variables using Dirichlet process mixture models. More importantly, we formalize the 
arguments for Laplacian embedding as suitable projections for graph clustering by providing 
theoretical support. Specihcally, we establish the consistency of the eigenspace of the 
graph Laplacian obtained from the estimated graph, which guarantees that the estimated 
graph Laplacian can be used as a suitable graph clustering tool. From a computational 
standpoint, we adopt fast computational methods for both graphical structure learning and 
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clustering of variables using nonparametric Bayesian methods. For clustering in particular, 
we resort to fast and scalable algorithms based on asymptotic representations of Dirichlet 
process mixture models as an alternative to computationally expensive posterior sampling 
algorithms, especially for high-dimensional settings. As an extension of this work, we show 
that our formulation can be used for multiple data sources, to simultaneously cluster the 
variables locally for each data set and also arrive at a global clustering for all the data sets 
combined. 

We evaluate the operating characteristics of our methods using both simulated and real 
data sets. In simulations, we compare our Bayesian nonparametric graph-based cluster¬ 
ing method with both standard (“off-the-shelf’) graph-based clustering methods as well 
as methods that do not incorporate the structural information obtained from a graphical 
model formulation so as to evaluate the effectiveness of using a graph-based clustering 
model. Using various metrics of clustering efficiency (such as normalized mutual informa¬ 
tion scores and between-cluster edge densities), we demonstrate the superior performance 
of our methods compared to the performance of the competing methods. Our methods 
are motivated by and applied to a novel pan-cancer proteomic data set, through which 
we evaluate protein networks and clusters across 11 different cancer types. Our analyses 
reveals several biologically-driven clusters that are are conserved across multiple cancers as 
well differential clusters that are cancer-specific. 

The paper is organized as follows. In Section we lay out the inferential problem and 
present preliminaries on graphical models and related concepts, including graph Lapla- 
cian and Laplacian embedding. In Section we propose methods for constructing the 
weighted adjacency matrices, and in Section]^ we present our nonparametric graph clus¬ 
tering method. We discuss the theoretical results pertaining to the consistency of the graph 
Laplacian in Section We present the results of our simulation study in Section and 
those of the real data analyses on pan-cancer proteomic networks in Section The Supple¬ 
mentary Materials contain extensions of our approach to multiple data sets and the results 
of additional simulations and real data analyses. We provide proofs of the theoretical 
results in an Appendix. 
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2 Basic inferential problem and preliminaries 


Let = (Xi,... ,Xn) be independent and identically distributed random variables of 
dimension p with mean 0 and covariance matrix S. We write Xi = (Xi j,... and 

assume that Xi, i = 1,... ,n are multivariate Gaussian variables. 

We also assume that there are K (true) underlying clusters for the p variables dehned as 
C = {Cl,..., Ck), which are tightly connected. Our basic inferential problem is to retrieve 
the clusters of variables from the data, P{C\X), which we do by breaking the problem 
into two sub-problems. (1) First, we learn the underlying graphical structure: P{Q\X) 
(see Section]^. (2) Then, we infer P{C\Q) using the appropriate probability models on 
the graph structures (see Section]^. We discuss each of these constructions in the ensuing 
sections. 

2.1 Graph Laplacians 

Consider an undirected graph Q = {V, E), where V represents the set of vertices {1,... ,p} 
and E is the edge set such that E C {{l,j) & V x V : I < j}. Two vertices vi and Vj are 
adjacent if there is an edge between them. The adjacency matrix is given hy A = {{aij)), 
where aij = 1 or 0 according to whether {l,j) E E or not. Alternatively, one may also 
consider a weighted graph, with weighted adjacency matrix W = {{wij)), where wij{> 0) 
denotes the “strength” of association (dehned suitably) and absence of an edge is rehected 
by a strictly zero edge weight. The degree of a vertex vi is given by di = Y7j=i the 

degree matrix can then be denoted hj D = diag((ii, ... ,dp). The clusters in the underlying 
variables are rehected by their conditional independence structure, so that variables in two 
diherent clusters are conditionally independent. The underlying graphical model forms a 
Markov random held such that the absence of an edge indicates that the corresponding 
random variables are conditionally independent. 

For an undirected weighted graph Q with weighted adjacency matrix W and degree 
matrix D, the graph Laplacian associated with Q is dehned as 

L = D-W. 

To be more precise, L is referred to as the unnormalized graph Laplacian. The normalized 
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versions of the graph Laplacians are given by 


L™ = I-D-^W. 


All of the graph Laplacians, L, L^y^, and Lrw, satisfy the property that they are positive 
semi-dehnite with p non-negative real valned eigenvalnes 0 = Ai < A 2 < ... < Ap. In fact, 
the mnltiplicity of the zero eigenvalne of the Laplacian is related to the nnmber of connected 
components of the nnderlying graph, which is formally stated in the following proposition. 


Proposition 2.1. Let Q be an nndirected graph with non-negative weights. Then the 
multiplicity d of the zero eigenvalue of the graph Laplacian L (or Lgym) equals the 
number of connected components in the graph. 


A formal proof of the above proposition along with the other properties of graph Lapla¬ 


cians can be found in Von Luxburg (2007). In light of the above proposition, graph Lapla¬ 
cians can provide a natural tool for clustering the underlying variables. We present two 
stylized examples in Figure one for a 10-dimensional graph with 3 connected components 
and another for an 18-dimensional graph with 6 connected components. The eigenvalues 
of the graph Laplacian equal the number of connected components in both cases. Also, the 
plot of the eigenvectors corresponding to the zero eigenvalues reveals that the eigenspaces 
of zero eigenvalues are spanned by indicator vectors that correspond to the connected 
components. 

In our setting, we utilize this information to construct a weighted adjacency matrix 
of dimension p, using suitable edge weights (which we discuss later), and then obtain the 
corresponding graph Laplacian. For the sake of brevity, we refer to the different forms of 
the graph Laplacian as L, irrespective of whether they are normalized or not. 


2.2 Laplacian embeddings 

Since L is a symmetric matrix, it has an orthonormal basis of eigenvectors. We let 
{i/i,... ji'Kn} denote the eigenvectors that correspond to the smallest Kn eigenvalues of 
L. We dehne the normalized Laplacian embedding as the map : {Xi,... ,Xp} —)• 


7 






Dimensions: 10x10 


(a) Adjacency matrix 



Dimensions: 18x18 


(d) Adjacency matrix 


Eigenvalues of normalized Laplacian 
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(b) Plot of eigenvalues 

Eigenvalues of normalized Laplacian 
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(e) Plot of eigenvalues 


Plot of eigenvectors for zero eigenvalues 



(c) Plot of eigenvectors 


Plot of eigenvectors for zero eigenvalues 





(f) Plot of eigenvectors 


Figure 1: Behavior of graph Laplacians in detecting connected components of a graph. 
The adjacency matrices are shown in plots (a) a 10-dimensional graph with 3 connected 
components and (d) an 18-dimensional graph with 6 connected components. Plots (b) and 
(e) show the eigenvalues of the corresponding normalized Laplacian of the graphs (the red 
line demarcates the zero and non-zero eigenvalues). Plots (c) and (f) show the eigenvectors 
(ev) corresponding to the zero eigenvalues of the graph Laplacians. 
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given by 

... ,i>i^Kr,)- ( 2 . 1 ) 

Typical spectral clustering methods generally proceed through the following steps. 
First, the normalized Laplacian is computed and each vertex Xi, I = 1,... ,p, is mapped 
to a iF„-dimensional vector through the above Laplacian embedding. Then, a standard 
clustering method is applied to the embedded data points. The Laplacian embedding helps 
to reveal the cluster structure in the data. A formal proof of this phenomenon has been 


studied recently; see Schiebinger et ah (2015). Thus, if the eigenvalues and eigenvectors of 
the true graph Laplacian, in other words, the population version of the graph Laplacian, 
denoted by £, are well approximated by that of the graph Laplacian L obtained from the 
data set, then Proposition 1.1 ensures that the connected components are revealed by the 
eigenvalues of L, and the corresponding Laplacian embedding helps to hnd the connected 
components, hence, the clusters. Accurate approximations of the eigenvectors are ensured 
under certain conditions on the eigenspace of the underlying matrices, provided L is close 
to C in certain matrix norms, such as the operator norm or Frobenius norm (see Section]^. 


3 Construction of adjacency matrices 

One of the primary steps for graph clustering is the construction of a suitable graph Lapla¬ 
cian for accurate identihcation of the underlying cluster allocations. Clustering based on 
Laplacians is closely related to a graph cutting algorithm, which forms partitions on the ver¬ 
tex set in such a way that within-partition edge weights are higher than between-partition 
edge weights. This requires efficiently and accurately dehning the edge weights so as to 
incorporate this property in the ensuing graph. We focus on Gaussian graphical models 
in this paper, since they provide a coherent probability model on the space of graphs and 
are computationally tractable. Specihcally, we assume a p-dimensional Gaussian random 
variable X = (Xi, ... ,Xp) having mean 0 and covariance S, where the precision matrix is 
given by G = The corresponding Gaussian graphical model can then be denoted by 
Q = (y, E), with V = {Xi,..., Xp} as the vertex set with an edge present between X; and 
Xj if and only of X; and Xj are conditionally independent given the rest. As before, the 
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absence of an edge is reflected in the entries of hi = ([cjij)), such that ojij = 0 if and only 
if {l,j) ^ E. Thus, the presence or absence of an edge between two variables is reflected 
in the inverse covariance matrix, and hence in the partial correlation matrix. We can take 
the absolute value of the partial correlation between Xi and Xj as the edge-weight between 

(V.V). 

One may also work with an unweighted adjacency matrix that consists only of binary 
(0 — 1) entries, which might be less efficient in some situations. For example, in cases where 
the partial correlation between two variables is near zero, assigning an edge between those 
two does not adequately account for the strength of association between those variables. To 
overcome this, we use the absolute value of the estimated partial correlations as the edge 
weights to construct the weighted adjacency matrix W and then derive the graph Laplacian. 
Estimation of the partial correlations can be performed using empirical methods for low¬ 
dimensional problems. For high-dimensional problems such as typical datasets arising from 
genomic and imaging applications, where p > n, empirical estimates are often unstable, if 
not estimable. To overcome this instability, sparse methods have been proposed for learning 
the covariance or inverse covariance matrix, as stated in Section [Tj In our framework, we 
want to incorporate the model uncertainty of the graph estimation in the clustering, for 
which we resort to Bayesian techniques for estimating the precision matrix. Classical 
variable selection methods typically do not incorporate model uncertainty, as opposed 
to Bayesian methods, which are capable of providing such measures by using posterior 
probabilities of models, as detailed below. 

Bayesian neighborhood selection: In this paper, we consider a neighborhood selection ap¬ 
proach that uses Bayesian model averaging. Consider p regression models Xi = + 

ei, I = 1,... ,p, where ei is a zero-mean Gaussian error with variance af corresponding to 
the regression of Xi on the rest of the variables. The coefficient estimates and 
obtained from regressing Xi on the rest of the variables and Xj on the rest of the variables 
can be used to obtain an estimate of the partial correlation between Xi and Xj. Note that 
Ppj) = —ujij/uii and = —coji/ujj. The partial correlation between Xi and Xj is given 
by rb = -uij/iuJiiUjjY^^, so that may be written as = siga{/3pj))y//3pj)/3j(^iy 

We describe the Bayesian model formulation for the above p regression models. We 
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have a (p — 1)-dimensional regression model in each of the cases with response Xi and 
regressors X_i := {Xi,... ,Xp}\{Xi}, / = 1,... ,p. For each of the p models, we dehne the 
(p — l)-dimensional indicator vector 7 / = such that = 1 if Xj is included in 

the model and ■ypj) = 0 otherwise, I = 1,... ,p. For / = 1,... ,p, the regression coefficient 
vector for the regressing Xi on the rest is denoted as jSi = 

For I = 1,... ,p, we specify the prior distribution on the model parameters for the 
regression model in a hierarchical manner as, 

=p(A I 


The choice of a prior distribution on the regression coefficients plays an important role in 
model assessment. In many applications, independent priors are put on the coefficients as 
a mixture of two components conditioned on 7 ;, given by 


p{A(i)} = {1 - + 7(i)P^^HA(i)}- 


The ‘spike and slab’ prior (Mitchell and Beauchamp, 1988) chooses the mixture components 


as p(°)(A(j)) = ll{o}(A(i)) p^^^(A(i)) = ll[-a>»](A(i))/2a for some a > 0. George and 


McCulloch (1993) used a different formulation of spike and slab models, assuming that 


(3 has a multivariate Gaussian scale mixture distribution defined by choosing p^^\(3i(j)) = 
N( 0 , 7 ^(^)) and P^^'^{Pi{j)) = The constant 7 ^) is chosen to be very small 

so that if 7/(7 = 0 , then has very negligible variance and may be dropped from the 
selected model. The constant is chosen to be large so that if 7/(7 = 1, then /?/(7 is 
included in the final model. The prior for 7 /(j) is modeled as 


7/(7 I M ~ (1 - m/(7)(5o(-) + m /(7 ~ U(0, 1), j ^ L 


In practice, the selection of the hyperparameters and upp may be difficult. Ish- 

waran and Rao (2000) introduced continuous bimodal priors leading to the hierarchical 
prior specification as Ao) I IjXpj) N( 0 , 7 /( 77/(7 | upp ~ (1 - m/(7)5o(-) + 

upj)6i{-), I ai, 02 ~ Gamma(ai, 02 ), M/(j) ~ U(0,1). 

The hyperparameters oi and 02 are chosen so that has a continuous bimodal 

distribution with spike at 0 and a right continuous tail. However, the effect of the prior 
vanishes with increasing sample size n, for which a rescaled version of the above spike and 
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slab formulation (Isliwaran and Rao, 2005) was proposed by rescaling the responses by a 


-y/n-factor , so that X^- = y/nXi^i, and using a variance inflation factor in the variance of 
the responses to account for the rescaling. The rescaled spike and slab model is especially 
useful in high dimensions, and is specified in a hierarchical manner (for the Ith regression) 
as, 


Xl^\X_i,,i3i,a^ 

indep. 



indep. 

o' 

T(i) 

iid 

rsj 

{1 - upj))6o{-) +ui(j)6i 

-2 1 
’h(j) 1 

iid 

rsj 

Gamma(ai, 02 ), 


iid 

r\j 

U(0,1), 


r\j 

Gamma(5i, 62 )- 


(3.1) 


Isliwaran and Rao (2011) proved strong consistency of the posterior mean of the re¬ 


gression coefficients under the rescaled model. We adopt the fast Gibbs sampling schemes 


proposed by Ishwaran and Rao (2005) which is detailed in the Supplementary Material 


(Section S2.1). 

We put a rescaled spike and slab prior on the regression coefficients /3ij for the p re¬ 
gressions in our context, and obtain parameter estimates using Bayesian model averaging. 
The estimated parameters are then used to obtain the estimated partial correlation ma¬ 
trix R = ((rb)), using the relationship between the partial correlations and the regression 
parameters mentioned above. The partial correlation matrix is then used to construct the 
weighted adjacency matrix W that corresponds to the graphical model. We use Bayesian 
model averaged estimates of the regression coefficients so that the model uncertainties are 
propagated while estimating the partial correlation matrix, which are subsequently used to 


calculate the Laplacian embedding using equation (2.1). 
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Nonparametric graph clustering using Laplacian em¬ 
beddings 


4.1 Dirichlet process mixture models 

Spectral clustering methods apply standard clustering techniques to the embedded data 
points with a priori hxed number of clusters. A practical way to choose the number of 
(unknown) clusters k is to identify the k smallest eigenvalues of the graph Laplacian L and 
then perform standard algorithmic clustering methods such as the k-means method. The 
major drawback in this regard is the requirement to prespecify the number of clusters, and 
also that the k-means algorithm tends to select clusters with nearly equal cluster sizes (as we 
demonstrate in simulations). To avoid such a situation, we propose to use Dirichlet process 
mixture models (DPMMs), which offer several computational and inferential advantages. 
We model the embedded data points obtained from the estimated graph Laplacian (as 
mentioned in the previous section) using a DPMM. 

Specihcally, we denote the embedded data points for variable Xi by yi = I = 

1,... ,p, where the dimension of the embedding (denoted by earlier) is chosen to be the 
number of inhnitesimally small eigenvalues of the estimated graph Laplacian. Note that, 
in a hnite Gaussian mixture model, the data are assumed to arise from the distribution 

c 

P{y) = ^^{y I Pc^c), 

C=1 

where tTc are the mixing parameters for C fixed components, and /ic, Sc are the mean and 
covariance of the corresponding Gaussian mixture components. We further assume that 
the covariances Sc are fixed to be al for all c = 1,..., G, n > 0. A standard Bayesian 
approach for inference in the above set-up can be described by the following hierarchical 
representation. 


TTi, . . . , TTc I tto ~ 

2^1, ... , Zp I , . . . , ~ 

yc \ p ~ 
yi\zu {hc}f=i ~ 


Dir (tto/G,..., tto/G), 
Discrete( tti, ..., ttc), 

N(/i^;,a/), 
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for some suitable prior distribution Gq. Since the covariances Sc are fixed, a prior dis¬ 
tribution over the means may be chosen to be N(0,p/), so that conditional distributions 
of the parameters may be obtained in a closed form for Gibbs sampling. In the above 
model, G {1,..., G}, I = 1,... ,p serves as the indicator variable for data point yi for a 
specihc cluster. One of the C clusters is hrst chosen according to the multinomial distribu¬ 
tion parameterized by tti, ..., ttc, followed by sampling from the corresponding Gaussian 
distribution parameterized by The mixture weights follow a symmetric Dirichlet dis¬ 
tribution with hyperparameter oq. A DPMM is obtained from the same generating process 
by letting G —)■ oo, and replacing the Dirichlet prior for tti, ..., ttc by using a stick-breaking 


construction (Sethuraman, 1994), 


f^j ~ Beta(l,Q;o)5 

i=i 

The corresponding sequence satishes ^ with probability one. 

DPMMs have the intrinsic property of clustering the data so that hard clusterings may 
be obtained by simulating from the posterior. A concise schematic representation of our 
Bayesian nonparametric graph clustering model in shown in Figure 


4.2 Posterior computations 
4.2.1 MCMC 


Gonditional posterior distributions of the cluster indicators are utilized for Gibbs sampling, 
looping through each data point yi, I = 1,... ,p. Data point yi is assigned to an existing 
cluster c with probability nc-i^{yi \ where Uc-i is the number of data points in 

cluster c, except yi. A new cluster is started with probability proportional to oq /N(y; | 
y,aI)dGo{y)- The conditional posterior distributions of the means are also obtained and 
used for Gibbs updating given all the data points in a particular cluster, after resampling 


all the clusters. Detailed procedures for using the Gibbs sampler are discussed in West 


et ah (1994) and Neal (2000). For completeness, we present the details of the MGMG 


procedure in the Supplementary Material Section S2.2. 
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Figure 2: Directed acyclic graphical representation of the Bayesian nonparametric graph 
clustering model. Circles represent stochastic model parameters, solid rectangles represent 
data and deterministic variables, dashed rectangles represent model constants, and solid 
and dashed arrows respectively represent stochastic and deterministic relationships. 
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4.2.2 Hard clustering via DP-means 


While MCMC approaches have benehts, one of the major drawbacks is their scalability for 
large p. Also, since our primary inference is on cluster indices for the p variables, per se, 
rather than parameter estimates, we propose to utilize a fast computing method that eval¬ 
uates the cluster indices corresponding to the above model formulation. [Kulis and Jordan 


(2012) considered an asymptotic version of the DPMM and developed an algorithm that 
gives the cluster assignments of the data points along with the cluster centers, without 
using the MCMC approach. Their method is closely related to the standard k-means algo¬ 
rithm, but with an additional penalty parameter on the number of clusters in the objective 
function. Their method, called DP-means, is based on the Gibbs sampler algorithm for the 
DPMM, and is derived using small variance asymptotics. 

Under the previously mentioned assumptions that the covariances Sc are hxed to be 
al, and the prior distribution Go over the means is taken to be N(0,p/), for some p > 0, 
the Gibbs probabilities can be computed explicitly. To implement DP-means, the base 
parameter uq is further assumed to be functionally dependent on a and p as uq = (1 + 
p/a)'^"/^exp(—A/2 ct), for some A > 0. The probability of assigning the data point I to 
cluster c is then proportional to 


Uc-i exp 




(4.1) 


where Uc-i denotes the number of data points already in cluster c, as introduced previously. 
Also, the probability for a new cluster is proportional to 

Then, as a —)■ 0 for a hxed p, the probability of data point i being assigned to cluster c goes 
to 1 when pc is closest to yi. One major advantage of using this procedure is scalability, 
especially in higher dimensions, where MCMC mixing might be slow. The DP-means 
algorithm performs similarly to MCMC-based methods, but saves time. The DP-means 
algorithm minimizes the objective function, 

k 

5^5^ll?/-^ciP + A/c, (4.3) 

c=l yeic 
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with 9c = being the resulting k clusters. The penalty parameter A 

controls the number of clusters. 


5 Consistency of the graph Laplacian 


One of the main contributions of this paper is to establish theoretical guarantees of our 
modeling endeavor. A critical assumption in this regard is that the eigenspace of the 
estimated graph Laplacian should well approximate the true eigenspace corresponding to 
the true graph Laplacian obtained from the true partial correlation matrix. We establish 
this using two steps. In the hrst step, we show that the operator norm of the difference 
between these two Laplacians is bounded and it is subsequently argued that the eigenvectors 
corresponding to the smallest eigenvalues chosen for embedding are close to their true 
counterparts. This ensures that the connected components in the graph are identihed 
accurately. 

We hrst dehne the notations to be used for the main theoretical results in this paper. 

By Tn = 0{5n) (respectively, o{6n)), we mean that Vn/Sn is bounded (respectively, 
r-n/Sn —)■ 0 as n —)■ cx)). For a random sequence = Op(Sn) (respectively, = 

op{Sn)) means that Pr(|X„| < M(5„) —)■ 1 for some constant M (respectively, Pr(|X„| < 
t 1 for all e > 0). 

For a vector x G we dehne the following norms: 


X 



1 < r < CX), 



max \Xj 


If A is a symmetric p x p matrix, let eig;^(A) < ■ ■ ■ < eigp(A) stand for its ordered 
eigenvalues. Viewing A as a vector in and an operator from (M^, || ■ ||r) to (M^, || ■ ||s), 
where 1 < r, s < cx, we have the following norms on p x p matrices: 


= (ELi > 1 < r < cx, ||A||oo = maXij \aij\, 

= sup{||Aa;||^ : ||a;||,. = 1}. 


The norms || ■ and || ■ || (r,r) are referred to as the L,.-norm and the Lr-opeiaXoi norm, respec¬ 
tively. Thus, we obtain the Frobenius norm as the L 2 -noinv given by ||A ||2 = A/tr(A^A). 
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Also, 


||A||(i,i) = maXjX^. \aij\, ||A||(oo,oo) = max* Xlj 

11^11(2,2) = {max(eigi(A^A) :l<i< 

and for symmetric matrices, ||A||( 2 , 2 ) = max{|eigj(A)| : 1 < i < p}, and ||A||(i^i) = 
II A|| (oo,oo)- 


5.1 Determining bounds for the estimated graph Laplacian 

We first present the results in which we show that the operator norm of the difference 
between the estimated graph Laplacian and the true graph Laplacian is bounded, which 
implies the closeness of the respective eigenvalues so that the dimensions of the embedded 
data points are identical for the estimated and true Laplacian embeddings. 

Theorem 5.1. Consider the normalized graph Laplaeian L = I — 
on the estimated weighted adjaceney matrix W and the true graph Laplacian C = I — 
T>-i/ 2 ')/y 21 -V 2 ^ yjliQfQ x> and W, respectively, are the true degree matrix and weighted adja¬ 
cency matrix. Then, under the assumptions that the minimum degree is bounded below by 
pV 2 (iogp)-V 2 -maximum degree is bounded above by p'^, 1/2 < k < 1, we have 

\\L - £11(2,2) < {\ogpf/Y-^/^\\W - W||2. 


Remark 5.2. In the present context of clustering, we use the absolute value of the estimated 
partial correlation matrix as the weighted adjacency matrix W. The corresponding true 
weighted adjacency matrix W is the true partial correlation matrix TZ. We have proposed to 
use a Bayesian neighborhood selection approach to estimate the partial correlation matrix. 
Consistency of the regression parameters would lead to the posterior consistency of the 
partial correlation matrix as well. Note that ||hL — W ||2 < ||R — ”^11 2 / hence, the posterior 
consistency of R leads to the posterior consistency of the estimated graph Laplacian L. In 
fact, \\L - £||( 2 , 2 ) = op(l) if {\ogpf/Y~^^^\\R - R-h = op{l)- 
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We present a proof of the above result in the Appendix. The bounds given above 
involve the difference between the estimated and true partial correlation matrices. Hence, 
the consistency of the graph Laplacian is dependent on the consistency of the partial 
correlation matrix. Apart from the Bayesian neighborhood selection approach in our paper, 
other consistent estimators of the partial correlation matrix or precision matrix would lead 
to the consistency of the graph Laplacian as well. This includes Bayesian estimation using 
Wishart priors, which are known to be consistent in the operator norm. 

Now we argue that the dimension of the embedded data points is identical to that 
obtained using the true Laplacian embedding. To show this, it suffices to argue that the 
eigenvalues of the estimated graph Laplacian and those of the true graph Laplacian are 
sufficiently close. By Weyl’s inequality, we have 

max |eigi(L) - eigi(£)| < \\L - £||(2,2) = op(l). 

Thus, the eigenvalues of L are close to those of C. As long as the eigen-gap (denoted 
by (5, say) between the first non-zero population eigenvalue and the corresponding zero 
eigenvalue(s) is large, the number of zero eigenvalues of L (or eigenvalues within a distance 
en from zero, where Cn —t 0 is the rate of contraction of L, dependent on the data) is equal 
to that of C. In fact, the condition that 5 > 2e„ suffices in this context for the dimensions 
of the embedded spaces to be identical. 

5.2 Closeness of the eigenvectors 

We now argue that the eigenvectors corresponding to these zero eigenvalues are also close 
to their population counterparts. We are dealing with orthonormal vectors (so that both 
are of equal magnitude), and hence evaluating the closeness of two such vectors reduces to 
determining the bounds of the principal angle between them. The theorem below presents 
the result on these bounds. The result involves the d zero eigenvalues of the true graph 
Laplacian C and the corresponding e„-small eigenvalues of the estimated graph Laplacian 
L. 

Theorem 5.3. Let V = {vi,...,Vd) G be the eigenveetors corresponding to the d 

zero eigenvalues of C and V = {vi,...,Vd) G be the eigenvectors corresponding to 
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the d zero (or Cn close) eigenvalues of L. Also, denote by 6 the eigen-gap between the zero 
eigenvalues and the first non-zero eigenvalue of C, and assume that 6 > 2e„. Then the 
principal angle between V and V can be bounded as 

The result easily follows that presented by Davis and Kahan ( 1970| , which has been 
recently improved by Yu et ah (2014), involving assumptions on population eigenvalues 
only. The improvement is signihcant in the sense that we only need to care about the 
eigen-gap in the true graph Laplacian £, not in the estimated one. We present their result 
in the theorem below, followed by the proof of our result above. 


Theorem 5.4 (Modihed version of Davis-Kahan theorem (Yu et al., 2014)). Let A,A^ 
l^pxp symmetric, with eigenvalues Xi < ... < Xp and Xi < ■ ■ ■ < Xp, respectively. Fix 1 < 
r < s < p and assume that 6 = min(Ar — A^-i, A^+i — A^) > 0, where Ap+i := oo, Aq := —oo. 
Let d ■.= s — r + 1, and let V = (vr, Vr+i,... ,Vs) E and V = {vr, Vr+i, ■ ■ ■ ,Vs) E 
have orthonormal columns satisfying Avj = XjVj and Avj = XjVj, j = r,r -\-1,..., s. Then, 

\\A - ^ 11 ( 2 , 2 ) 


Sin 0 ( 1 /,!/) II(2,2) < 


In our context, we take the hrst d zero eigenvalues of the graph Laplacian, so that 
r = l,s = d in the above theorem. The matrices A and A are taken to be C and L, 
respectively, so that the sine of the principal angle between the eigenvectors is bounded 
above by the difference in the operator norm of the graph Laplacians. From Theorem 


5.1, it follows that the L 2 -operator norm of the sine of the principal angle between the 


eigenvectors goes to zero with high probability. 

Thus, in summary, we have proved that L well approximates C with respect to the 


eigenspace, and hence in light of Proposition 2.1, L can be used as an effective tool to 
cluster the variables with graphical dependencies. 


6 Simulations 

We perform simulation studies to evaluate the performance of our Bayesian nonparametric 
graph clustering (BNGC, henceforth) method in scenarios with varying dimensions and to 
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compare our method with other competing methods. Specihcally, we evaluate the utility of 
using the partial correlation matrix for dehning the weighted adjacency matrix for graphs 
and the effectiveness of using a Bayesian nonparametric model for cluster identihcation. 
Simulation design We consider p-dimensional random variables with p = 100, 200, 500, 
and sample size n = 100,200 for each p, with includes both n ^ p and n > p scenarios. The 
true number of clusters K satishes 1 < K < p. We consider different values of K such that 
^ = LfJ) each K, we randomly partition the p variables in K non-empty 

clusters, with all partitions having equal probability of occurrence. We simulate 10 such 
partitions for each K. For a given partition {Xi ,..., Xk} of X, we generate data from a 
Gaussian mixture model as 

K 

fi^) = Y[fji^j)- ( 6 - 1 ) 

j=i 

Here, fj is a multivariate normal distribution of dimension pj, with mean 0 and variance 
Sj. Sj follows a Wishart distribution with degrees of freedom pj + 1 and scale matrix 
identity, where pj denotes the size of cluster j, 1 < j < , and Pj = P- 

For each n,p and K, we simulate 100 data sets and apply our method to identify the 
clusters. For construction of the adjacency matrices, we use the ‘spikeslab’ package in 
R, applying the default parameter settings. For graph clustering using DPMM, we resort 
to the DP-means algorithm for faster computation, choosing the penalty parameter A for 
the number of clusters to be 1/2. We compute normalized mutual information (NMI) 
scores that correspond to each combination of n,p,K to assess the performance of our 
method. NMI scores are useful for evaluating clustering performances when the true cluster 
labels are known. We denote the true cluster labels as C and the cluster labels obtained 
from a suitable clustering method as C. Note that both C and C are partition sets of 
the same set of p variables. As a measure of uncertainty, the entropy of a partition set 
T = {Ti,..., Ts}, s = |T| is dehned as 


|T| 


H(T) = 5^MiogM. 

p p 


i=l 


Also, the mutual information between C and C is given by 


|c| |cj 

MI(C,C) = EE 


i=i j=i 


\anc,\ Plane, 
p \a\\c,\ 
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where Qs and Cjs are elements of the sets C and C, respectively. The normalized informa¬ 
tion score between C and C is then dehned as 

NMI scores are bounded between 0 and 1, with zero implying complete independence 
of cluster labelings, and values close to one indicating signihcant agreement between the 
clusters. 

In addition to NMI scores, we compute the between-cluster edge densities to assess the 
overall clustering performance of our methods in hnding tightly connected graph compo¬ 
nents. The between-cluster edge density for a partition set C is given by 

edge density(C) = Wij, 

ki^k2 

where Wij is the edge weight between Xi and Xj for the corresponding graphical model. 
Between-cluster edge densities are bounded below by zero, with zero implying that the 
partitioning of the graph is such that there are no edges shared by the vertices in different 
clusters. In our simulation design, the true graphs are such that between-cluster edge 
densities are perfectly zero; hence, the above measure is a good tool to use in evaluating 
the clustering performance of a method. Deviation of the above measure from zero indicates 
non-agreement with the true cluster labels. 

Comparison with competing methods. To benchmark our results, we use three com¬ 
peting methods: k-means, graphical kernel-based (graph-kernel) and hierarchical clustering 
with average-linkage (ALC) methods. The hrst two are standard spectral clustering algo¬ 
rithms. The k-means method uses a weighted adjacency matrix that is identical to ours. 
We also compare our method with spectral algorithms applied to the adjacency matrices 
obtained from the data using kernel-based similarity measures. We use the kernlab pack¬ 
age in R for htting the latter method. Note that both spectral clustering methods require 
the specihcation of the number of clusters beforehand. The comparison to the k-means 
method shall assess the performance of using a nonparametric Bayesian model to arrive 
at the hnal cluster labels, as they both utilize the identical adjacency matrix. The com¬ 
parison to the graph-kernel method shall assess the performance of the adjacency matrix 


NMI(C,C) 


MI(C,C) 

/h(C)H(C) 
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constructed with the estimated partial correlation matrix from the Bayesian neighborhood 
selection approach, with regular adjacency matrices used in most of the applications. For 
the ALC, we use the empirical correlation matrix and provide the true number of clusters 
while performing hierarchical clustering. Note that for all the competing methods, we need 
to supply the number of clusters, and we provide the true number of clusters while assess¬ 
ing the performance, thus giving the methods a fair advantage. In comparison, our BNGC 
method learns the number of clusters from the data set, itself. 

Results. The complete simulation results using the different metrics and scenarios are 
presented in Table 1. We observe that for all the methods, the NMI scores increase with 
an increase in sample size for hxed p and K. However, the NMI scores corresponding to 
BNGC and k-means methods indicate that these two methods have signihcantly better 
performance than the graph-kernel and ALC methods across all n,p,K scenarios. In fact, 
even for n = 200, we observe NMI scores above 0.95 for all p and K when using the BNGC 
and k-means methods, but the graph-kernel and ALC methods fail to achieve such a mark. 
Hence, the results favor the partial correlation matrix as the adjacency matrix for graph 
clustering. The performances of the methods with respect to the NMI scores are shown in 
Figure 

The superior performances of the BNGC and k-means methods are also reflected in 
the edge densities between the clusters. For both methods that use the partial correlation 
matrix as the adjacency matrix, the between-cluster edge densities decrease with increasing 
sample sizes, but no such trend is seen for the remaining two methods. Also, although the 
edge densities for the BNGC and k-means methods shrink to zero, implying a signihcant 
absence of edges between clusters, the corresponding edge densities for the graph-kernel 
and ALC methods are much higher than zero, implying a poor clustering performance. 
This result further strengthens the use of a partial correlation matrix as an effective tool 
for graph clustering in such cases. We conjecture that the dependence structure in a graph 
is more efficiently captured by the partial correlation matrix in comparison with other 
adjacency matrices, such as the kernel-based adjacency matrix. 

Although BNGC and k-means methods perform similarly, the performance of the BNGC 
method is slightly superior. Also, the measure of the between-cluster edge densities goes 
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Figure 3: Normalized mutual information (NMI) scores for different values of n and p that 
correspond to the various clustering methods. 


to zero faster under our method than under the k-means method. This implies that clus¬ 
ter misspecihcation is lower under the BNGC method. In addition, the BNGC method 
performs better than the other methods when choosing the number of clusters. 
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•f implies that the method failed to produce any output due to memory problems 




























7 Proteomic signaling networks in cancer 


7.1 Scientific problem and data description 


Proteins are the ultimate effector molecule of cellular functions. Proteomics, in general, can 
be defined as a large-scale high-throughput study of proteins from a variety of biological 
samples in order to investigate their ontology, classification, expression levels, and prop¬ 
erties. A primary functional proteomic technology is reverse-phase protein array (RPPA), 
which allows for quantitative, high-throughput, time- and cost-efficient analysis of pro¬ 


teins using small amounts of biological material (Paweletz et al., 2001 Tibes et ah, 2006). 
RPPA allows for the simultaneous assessment of multiple protein markers in multiple tumor 
samples in a streamlined and reproducible manner (Sheehan et al., 2005; Spurrier et al.| 


2008). This technology has been extensively validated for both cell line and patient samples 


(Tibes et al., 2006| Hennessy et al., 2010| Nishizuka et al., 2003), and its applications range 
from building reproducible prognostic models (Yang et ah, 2013) to generating experimen¬ 


tally verified mechanistic insights (Liang et al., 2012) and biomarker discovery, especially 


in cancer (Hennessy et ah, 2010). For a detailed introduction, data pre-processing and 


normalization of RPPA data, see Baladandayuthapani et al. (2014). 

It is well established that oncogenic proteomic changes occur in a coordinated manner 
across multiple signaling networks and pathways, reflecting various pathobiological pro¬ 


cesses (Zhang et al. 


Halaban et ah, 2010). Numerous inhibitors of such pathways 


have been used in clinical trials, frequently demonstrating dramatic clinical activity. In¬ 
hibitors that target protein signaling pathways have been approved by the U.S. Food and 
Drug Administration for a variety of cancer types, including leukemia, breast cancer, colon 


cancer, renal cell carcinoma, and gastrointestinal cancer (Davies et ah, 2006). Thus, de¬ 
veloping an accurate understanding of the composition (i.e., clustering) and topology (i.e., 
graph) of these protein signaling networks across multiple cancer types can provide deeper 
biological insights about proteomic activity. 

The proteomic data set we consider here was generated by RPPA analysis of 3467 pa¬ 
tient tumor samples across 11 cancer types, and was obtained from The Cancer Genome 
Atlas (TCGA, http://cancergenome.nih.gov). The tumor samples include 747 breast can- 
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cer (BRCA), 334 colon adenocarcinoma (GOAD), 130 renal adenocarcinoma (READ), 
454 renal clear cell carcinoma (KIRC), 412 high-grade serous ovarian cyst adenocarcinoma 
(OVCA), 404 uterine corpus endometrial carcinoma (UCEC), 237 lung adenocarcinoma 
(LUAD), 212 head and neck squamous cell carcinoma (HNSC), 195 lung squamous cell 
carcinoma (LUSC), 127 bladder urothelial carcinoma (BLCA), and 215 glioblastoma mul¬ 
tiforme (GBM) samples. From each tumor sample, we have measurements of 181 different 
proteins that cover major functional and signaling pathways in cancer such as P13K, MAPK 
and mTOR. Based on the availability of protein data across a large number of tumor sam¬ 
ples, the scientihc objectives on which we focus in this paper are: (a) evaluate how the 
protein network topology change for each cancer type; and (b) use this information to 
evaluate how proteomic clusterings change within and across cancer types. These investi¬ 
gations will provide insights into clusters that are conserved across multiple tumors as well 
differential networks/clusters that are tumor-specihc. 


7.2 Results 


7.2.1 Cancer-specific clustering 

For each cancer type, we apply our nonparametric graphical clustering model, as explained 
in the previous sections, to obtain cancer-specihc networks and corresponding clusters. The 
number of clusters identihed by our method for each cancer type is shown in Figure]^ The 
results show considerable cluster heterogeneity between cancer-types, with bladder cancer 
(BLCA) having the largest number of clusters and squamous cell lung cancer (LUSC) the 


lowest. Those results are consistent with previously published Endings (Akbani et al 


2014), where bladder cancer was shown to be the most heterogeneous disease among the 


11 difierent tumor types studied, in terms of proteomic activity. 

The corresponding networks of proteins for BRCA, LUSC, READ and GBM are pre¬ 
sented in Figures [5a] [5bl [6a| and [6b| respectively; the rest are provided in the Supplementary 
Materials (Section S4). For better visualization, we highlight clusters that include at least 
4 proteins. As is evident from the figures, the within-cluster edges are considerably higher 
compared to the between-cluster edges, which establishes that our method performs rea¬ 
sonably well in picking relevant protein clusters. 
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Number of clusters for different cancertypes 



LUSC READ BRCA GOAD OVCA UCEC KIRC GBM LUAD HNSC BLCA 


Figure 4: Number of clusters identified by the BNGC method for different cancer types. 

7.2.2 Validation using known signaling pathways 

To further investigate the proteomic clusters identihed by our method, we used known 
biological knowledge - proteins that are mapped to existing signaling pathways. We used 
information from 12 signaling pathways: apoptosis, breast reactive, cell cycle, core re¬ 
active, DNA damage response, EMT, hormone receptor, hormone signaling, PI3K/AKT, 
RAS/MAPK, RTK, and TSC/mTOR. We then defined a pathway enrichment probability 
matrix that corresponds to cancer type f as a metric to evaluate which pathways are 
grouped together in different cancers. This matrix is of dimension Kt x 12, where Kt is 
the number of major clusters in cancer type t. For each cancer type, we follow a Bayesian 
hypothesis testing procedure to check whether the proportion of proteins from pathway j 
in cluster k is signihcantly higher than the proportion of proteins from pathway j outside 
cluster k. This is a simple test to determine whether the binomial proportion is greater 
than 1/2, and can be carried out using a beta-binomial model from a Bayesian perspective. 
Denoting the number of proteins from pathway j in cluster k as yuj and the total number 



of proteins in pathway j as Nj, the test model is given by 


Hkj I 6 ~ Bm{Nj,9), 9 ~ Beta(l, 1). 

The posterior distribntion of 9 \ Nj,ykj is Beta.{ykj + '^,Nj — y^j + 1). Thns, the 
element of for a particnlar cancer type t is given by 

> 0-5 I Nj, ykj), 

the posterior probability that 9 exceeds 1/2, which can be easily compnted nsing Monte 
Carlo methods. 

The corresponding pathway enrichment probabilities for the different clnster-cancer 
combinations are shown as a heatmap in Fignre Major pathways that are enriched 
{Ve > 0.5) are also shown in Table We hnd major enrichment in three pathways: 
hormone receptor (6 cancers), core reactive (6 cancers), and RTK (3 cancers). 


Cancer type 

Enriched pathways 

LUSC 

Cell cycle. Core Reactive, Hormone receptor, RAS/MAPK, RTK 

LUAD 

Core reactive, DNA damage response, RTK 

READ 

Core reactive. Hormone receptor, TSC/mTOR 

COAD 

Breast reactive. Core Reactive, Hormone receptor 

BRCA 

Breast reactive. Hormone receptor, RTK 

OVCA 

Core Reactive, RAS/MAPK 

UCEC 

Core Reactive, Hormone receptor 

KIRC 

Cell cycle, TSC/mTOR 

CBM 

Hormone receptor 

BLCA 

Hormone Signalling 


Table 2: Table showing pathways for different cancer types with pathway enrichment prob¬ 
ability exceeding 0.5. 
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7.3 Biological interpretation of results 

A comparison of Figure]^ with Table shows a roughly inverse relationship between the 
number of clusters versus the number of pathways identihed in a given tumor type (with 
a few exceptions). For example, BLCA has the largest number of clusters but only one 
pathway, whereas LUSC has the fewest number of clusters, but the largest number of 
pathways (hve). Part of the explanation is that we are looking for pathways that are 
enriched consistently across the full cohort of samples within a tumor type. If a tumor type 


is very heterogeneous (such as BLCA, Network et ah (2014)), that decreases the chances 
that we will hnd enrichment of a pathway consistently across all the samples within that 
tumor type, which in turn translates to fewer enriched pathways found. However, tumor 
type heterogeneity will result in a greater number of distinct clusters found. GBM, for 
instance, was previously shown to have 4 to 5 distinct subtypes (Verhaak et ah, 2010| 
Brennan et al.| 2013), but only one enriched pathway was found by us. 


Several of the pathways found in Table have known biological underpinnings. For 
example, it is well-known that a large subset of women’s cancers, such as breast (non-triple 


negatives) and endometrial, have elevated hormone receptors (ER and PR) (Perou et ah 


2000 Sorlie et ah, 2001 Creasman, 1993), which was picked up by our method. In addition. 


a substantial number of breast cancers have elevated HER2 levels and up-regulation of the 


RTK pathway as well (Perou et ah, 2000 Sorlie et ah, 2001). The breast reactive pathway 


was dehned using previously described breast reactive samples (Network et ah, 2012) and 
successfully identified by our algorithm as a validation check. The reactive markers were 


also found to be high in colorectal cancers (COAD/READ) in a previous study (Akbani 


et ah, 2014), and we have found the same pathway to be activated in COAD/READ. 


Hormone signaling activity related to the expression of GATA3 is high in both normal and 


malignant bladder samples and it is captured by our results (Miyamoto et ah, 2012). The 
tumor suppressor gene BTG3 has been shown to be down-regulated in renal cancers through 


hyper-methylation of its promoter (Majid et ah, 2009). BTG3 is a cell cycle inhibitor, so 
its down-regulation increases cell cycle activity in renal cancers, which is also illustrated 
by our results. Other novel results, such as the role of hormone receptors in GBM, remain 
to be investigated in detail. 
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(b) 


Figure 5: Protein networks for (a) breast cancer (BRCA) and (b) lung squamous cell 
carcinoma (LUSC). Clusters with at least 4 proteins are color-coded; those with fewer are 
gray. 
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(a) 



(b) 

Figure 6: Protein networks for (a) renal adenocarcinoma (READ) and (b) glioblastoma 
multiform (GBM). Clusters with at least 4 proteins are color-coded; those with fewer are 
gray. 
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8 Summary and discussion 


We develop nonparametric methods for identifying clusters of variables for moderate to 
high-dimensional data with graphical dependencies. Our fully probabilistic methods al¬ 
low joint graphical structure learning and cluster identihcation utilizing the eigenspace of 
the underlying graph Laplacians. We provide rigorous theoretical justihcations for choos¬ 
ing Laplacian embeddings as suitable projections for graph-structured data. In addition, 
we present fast and scalable computing techniques that are capable of handling high¬ 
dimensional data sets. Through simulations, we demonstrate superior numerical perfor¬ 
mance of our methods over standard competing methods from the literature. Our methods 
are motivated by and applied to a novel pan-cancer proteomic dataset where we infer tightly 
connected clusters of proteins across various cancer types validate them using known bio¬ 
logical pathway-based information. 

From the data analysis perspective, the proteomic data we have, consists of the same 
set of proteins over different subjects and cancer types. Although we analyzed each data set 
separately, a combined analysis across different cancer types can be performed by pooling 
the “matched” data across proteins. In the Supplementary Material Section SI, we discuss 
a possible approach to this problem using our Bayesian graphical clustering method. The 
approach takes into account the graph Laplacians for each of the data sets and uses a joint 
model for the local clusters within the individual data sets and a single global cluster across 
all cancer types. An open problem in this regard is that we do not know the behavior of 
the Laplacians in determining the overall global clustering; we leave that for a future study. 

Another possible extension of the clustering problem discussed in this article is to work 
with discrete or mixed graphical models, where the variables are no longer required to 
be continuous, but may vary over different domains, including categorical variables. Such 
data sets are becoming more abundant in the literature; thus, smarter scalable methods 
need to be developed. In these contexts, the relations among the different variables become 
nonlinear; hence, we need to extend our work beyond Gaussian graphical models. Preci¬ 
sion matrices (or partial correlation matrices) do not provide an effective way of dehning 
the edges of the corresponding graph. We need to develop measures that can capture the 
nonlinear dependency among the vertices of the graph and translate the strength of asso- 
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ciation to the corresponding edges. We are currently exploring these aspects using fully 
nonparametric techniques for graphical models. 
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A Proofs and additional results 

We present the proof of the consistency of the graph Laplacian. The proof uses standard 
matrix ordering inequalities, which we present in the following lemma. 

Lemma A.l. For symmetric matrices A and B of order p, we have, 


1 . 


< I|A||(2,2) < ||A||2 <p||A||oo, 


2 - ||AR||(2,2) < ||A||(2,2)||i?||2, ||Ai?||(2,2) < || A || 2 || 51| (2,2) ■ 


Proof of Theorem 5_A. We show the consistency of the graph Laplacian in the operator 
norm. We work with the normalized Laplacian Lgynn and denote it by L for simplicity. 
Recall that L = I - is the graph Laplacian obtained from the data with 

weighted adjacency matrix W = {{wij]), and C = I — jg the true Laplacian 

corresponding to the true adjacency matrix W = ((wf^)). The L 2 -operator norm of the 
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difference between the matrices L and C is given by 


||L)-V2|^L)-V2 _ 2)-i/2yyp-i/2||^2^^^ 

= \\D-^/‘^{W - + L)-^/2yyp-l/2 _ p-l/2^ ^ p-l/2 _ p-l/2^yyp-l/2||^^^^^ 

< \\D-^/‘^{W - W)D-^/\2,2) + - 1^“'/')II(2,2) 

= T 1 + T 2 + T 3 . (A.l) 


Now, 


Note that 


Ti = \\D-^/‘^{W -W)D-^%2,2) < ||/^■i(2,2)||W^- W||(2,2). 


V 11 ( 2 , 2 ) = max|eig;(P ) = max 




= - 7 -V = —’ (^- 2 ) 

where Tp is the minimum degree of a vertex divided by the maximum possible degree. 
We have assumed that > (plogp)“^, so that pTp > p^/^(logp)“^/^. Hence, we obtain 
||^^■1(2,2) < (logp/p)^/2. Also, 


|I^-'||(2,2) < ||P-'||(2,2) + p-'-^^-'| 

1-1 


( 2 , 2 ) 


which gives us 


Then, 


< ||P-i(2,2) + \\D-%2,2)\\D - P||(2,2)p-i(2,2), 
p-i(2,2) 


-D 11(2,2) < 


l-|lI)-l||(2,2)||/^-iP||(2,2)’ 


(A.3) 


11^-2^11(2,2) < ||/^-P||2 = {tr(D-2^)'} 


2-1 1/2 


P P 


1/2 


P P 


1/2 


E E-'.-E 

1=1 \j=i j=i 

= p'/'||1T->V||2. 


w. 


1=1 j=l 


0 \2 


Thus, \\D ^ 11 ( 2 , 2 ) < (logp/p)^/^ with high probability if (logp)^/^||l/T — WII 2 = op(l)- 
Hence, we obtain, Ti < p~^/‘^{\ogpY^‘^\\W — W|| 2 . 
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Now, we determine bounds for the term T 2 . We have 


T2 = 

< ll^■'/"ll( 2 . 2 )ll^^■'/^ll( 2 , 2 )||>V||( 2 , 2 )||/^'/^ - ^^'/^||( 2 , 2 ) (A.4) 

Now, 11 ^ 1/2 _ pi/2||^2,2) < \\D^/^ - V^/^y, SO that we get, 

T 2 < ||/^-'/"||(2g)p-'/'ll(2,2)||W||(2g)p'A||l^ 

Also, ||P'^A||^ 2 , 2 ) = < P"^'^^(logp)^A^ and = ||A'”^||( 2 , 2 ) < p“^A(iogp)V 2 _ 

Hence, we obtain 

T 2 < p-^/\\0gpf/^W - W||2i|>V||(2,2). 

Similar arguments lead to the same bounds as above for T 3 . We have || W||( 2 , 2 ) < ||W||(oo,oo) = 
max^^^i 1 1 ( 7 ;° I < p'^, say. Then, 

\\L - £ 11 ( 2 , 2 ) < p'^-'A(logp)3/4||^ _ ^||^_ 

This completes the proof. □ 
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Supplementary 


SI Clustering for multiple datasets 

Suppose we have J data sources with different set of subjects but same set of variables 
corresponding to each of the subjects and data sources. Our aim is to cluster the variables 
(for example, proteins in cancer data) for multiple data sources, involving the same set of 
variables, taking into account the heterogeneity of the data sets and the interdependence 
between them. We approach this problem from a Bayesian graph clustering point of view 
as in the main paper, but now modifying our method so as to deal with multiple data sets. 
We discuss the clustering model and method in the following sections. 


Sl.l Clustering model and method 


Lock and Dunson (2013) considered the problem of clustering a fixed set of objects based 


on multiple datasets. Their method, called Bayesian Consensus Clustering (BCC), tries to 
simultaneously determine the source-specific clustering and the global clustering of objects 
incorporating the heterogeneity of the individual datasets. BCC uses Dirichlet mixture 
model for different data sources. The conditional distribution of the source-specihc clus¬ 
terings given the global clusterings are modeled through a suitably defined dependence 
function with an adherence parameter. The number of local and global clusters are taken 
to be equal. Effective estimation of the clustering model is carried out using a Bayesian 
framework involving MCMC methods. 

We borrow a similar idea like above, but instead of clustering the objects, we cluster the 
variables in question. We first construct the weighted adjacency matrix corresponding to 
each of the data sets using the methods described in the Section 3 of the main manuscript, 
and then obtain the corresponding graph Laplacians. Corresponding to the J data sources, 
we perform Laplacian embedding of the variables so as to get 3 ^ 1 ,..., as the modified 
data. The embedded data point for variable t in data set j is denoted by Yjt. We assume 
that there are separate local clusters for each of the data sets (denoted by Cj for data 
set j) and one global cluster involving all the data sets (denoted by C^). Thus, each Yjt 
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belongs to a local cluster G {1,..., K} and one global cluster Cf. Each Yjt, t = 1,... ,p 
is assumed to be independently drawn from a i^-component mixture distribution with 
parameters 0^1,, OjK- The local clusters Cj = {Cji ,..., are dependent on the global 
clusters = (Cf ,... ,Cp as 

P(C‘,^klCf) = i^(k,Cf,aj), (Sl.l) 


where z/- is the dependence function controlled by the parameter aj. The form of the 
dependence function is given by 

ra \ rqi 9^ 

^ (SI.2) 

I otherwise 

where aj G [ 1 /i^, 1 ] acts as an adherence parameter of the local clustering for data source 
j to the global one. 

Assuming a Dirichlet prior on the probabilities tt^ = P(Cf = k), we can get the proba¬ 
bility that variable t belongs to cluster k in data source j as 

P(Cji = A; I vTi,.. .,tik) = T^kOij + (1 - TTfc) ^ , (SI.3) 


and the conditional distribution of the global clusters is obtained as 

J 

P(Cf = fc I ..., cl, Til,..., tik, aj) cx TTfc JJ z/(Cji, k, 

i=i 


Oi 


(S1.4) 


The parameter estimates along with the local and global clusters can be obtained using 


MCMC as described in Lock and Dunson (2013). One major challenge in application of 
this method is in the choice of the number of clusters. One adhoc technique would be to 
choose the maximum of the number of clusters obtained from clustering the data sources 
separately using our method based on DP-means. 


SI.2 Application to pan-Cancer proteomic data 

We consider the proteomics data described in the previous section, and apply the modihed 
BCC method as described above to cluster the proteins integrating over multiple cancer 
types. We would like to explore how the proteins group together across multiple sources 
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especially when the inter-dependence among the sources are modeled statistically. We 
consider two different types of lung cancers, namely LUAD and LUSC, to find the groups 
of proteins which move together in these cancer types. For both the cancer types, we 
found that for separate clusterings using our method described in this paper, there are 
nine clusters which have at least four members. Hence we specified the number of clusters 
in the modihed BCC method to be nine. The resulting global cluster is shown in Figure 



Figure SI: Figure showing global clustering of proteins for LUAD and LUSC cancer types. 
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S2 Details of posterior computations using MCMC 


S2.1 MCMC for Bayesian neighborhood selection 

We present the details of the MCMC procedure for the rescaled spike and slab prior. The 


MCMC procedure is given in detail in Ishwaran and Rao (2005). The procedure, known 


as Stochastic Variable Selection (SVS) is given as below. We present the details for the 
regression model with response Xi and the rest of the variables as regressors. 

Posterior values (/3i, 'yi,Ti,ui, af \ X^) are drawn using the hierarchical Bayesian model 
(3.2) of the main manuscript, where Xj* = and ti = Also let us 

denote X_i = (Xi,..., X/_i, X^+i,..., Xp), A^q) = li{j)r^Q) and The Gibbs 

sampler algorithm runs as follows - 

1. Simulate f3i \ Xi, af, X/" from a multivariate Normal distribution with mean HiX^iXi 
and covariance crfT,i, where 

Sz = (XTzX_z + ncrf A; = diag(Az(i),..., Az(z+i),..., Az(p)). 


2 . Simulate 


Xiij) I 


indep. 




Ul,l{j) + U2,l{j) 


Ul,l(j) + U2,l{j) 


where = (1 - ui)vq exp(-^^|^), = Mexp(-|¥^) 


3. Simulate | f3i,'yi(j) Gamma + 1/2,02 + 

4. Simulate M; | A; ~ Beta(l + #{j : A^q) = 1 }, 1 + #{j : A^q) = z/q}). 

5. Simulate aj~‘^ \ f3i, X/ ~ Gamma(6i + n/2, &2 + ||Xf — X_if3\\^/2n). 

6 . Set Az(j) = j ^ h 

The above steps complete one iteration of the Gibbs sampler. 
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S2.2 MCMC for Clustering using Dirichlet Process Mixture Mod¬ 
els 

We now present the details of the MCMC procedure for the Dirichlet Process Mixture 
Models used in graph clustering as described in Section 4 of the main manuscript. The 
choice of conjugate priors leads to availability of exact analytical forms of the posterior 
distributions so that Gibbs sampling is easily accomplished. The steps in the MCMC 
procedure are as below. 

Given and from iteration f —1, and are sampled 

as - 


1. Set z = 

2. For / = 1,... ,p, 


(a) As we are going to sample a new zi for data point yi, remove yi from cluster zi. 

(b) If the yi removed above was the only data point in that cluster, the cluster 
becomes empty. We remove this cluster, and C is decreased by 1. 

(c) We re-arrange the clusters so that the cluster labels are 1,..., C. 

(d) A new sample for zi is drawn as 


Pr(z/ = c, c < C) oc 
Pr(zi = C -|- 1) oc 


T~' i N(y, I 

n -|- ao — 1 

tto 

n + aQ — 1 


(e) \i zi = C + we get a new cluster. The cluster index is C -|- 1. Sample a 
new cluster parameter pc+i from a multivariate Normal distribution with mean 
(p“^/ -|- a~^I)~^a~^yi and covariance (p“^/ -|- . 

3. For c = 1,... ,6*, sample cluster parameter for each cluster from a multivariate 
Normal distribution N((p“^/ + nc(J~^I)~^Ylizi=cyh{p~^^ + where ric is 

the number of data points in cluster c. 


4. Set zW = 2 . 
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S3 Supplementary figures for different cancer types 


CD-CD 




Figure S2: Clusters for Uterine corpus endometrial carcinoma (UCEC). Clusters with at 
least 4 proteins are color-coded; those with fewer are gray. 
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(b) 

Figure S3: Protein networks for (a) Bladder urothelial carcinoma (BLCA) and (b) Colon 
adenocarcinoma (GOAD) cancer types. Clusters with at least 4 proteins are color-coded; 
those with fewer are gray. 
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(b) 

Figure S4: Protein networks for (a) Head and Neck squamous cell carcinoma (HNSC) and 
(b) Renal clear cell carcinoma (KIRC) cancer types. Clusters with at least 4 proteins are 
color-coded, those with fewer are gray. 
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(b) 


Figure S5: Protein networks for (a) Lung adenocarcinoma (LUAD) and (b) Ovarian cys- 
tadenocarcinoma (OVCA) cancer types. Clusters with at least 4 proteins are color-coded, 
those with fewer are gray. 
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